## [1] 6497 14
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "color"
## 'data.frame': 6497 obs. of 14 variables:
## $ X : int 4898 4895 4894 4893 4892 4891 4889 4883 4882 4880 ...
## $ fixed.acidity : num 6 6.6 6.2 6.5 5.7 6.1 6.8 5.5 5 6.6 ...
## $ volatile.acidity : num 0.21 0.32 0.21 0.23 0.21 0.34 0.22 0.32 0.235 0.34 ...
## $ citric.acid : num 0.38 0.36 0.29 0.38 0.32 0.29 0.36 0.13 0.27 0.4 ...
## $ residual.sugar : num 0.8 8 1.6 1.3 0.9 ...
## $ chlorides : num 0.02 0.047 0.039 0.032 0.038 0.036 0.052 0.037 0.03 0.046 ...
## $ free.sulfur.dioxide : num 22 57 24 29 38 25 38 45 34 68 ...
## $ total.sulfur.dioxide: num 98 168 92 112 121 100 127 156 118 170 ...
## $ density : num 0.989 0.995 0.991 0.993 0.991 ...
## $ pH : num 3.26 3.15 3.27 3.29 3.24 3.06 3.04 3.26 3.07 3.15 ...
## $ sulphates : num 0.32 0.46 0.5 0.54 0.46 0.44 0.54 0.38 0.5 0.5 ...
## $ alcohol : num 11.8 9.6 11.2 9.7 10.6 ...
## $ quality : int 6 5 6 5 6 6 5 5 6 6 ...
## $ color : chr "White" "White" "White" "White" ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## alcohol quality color
## Min. : 8.00 Min. :3.000 Length:6497
## 1st Qu.: 9.50 1st Qu.:5.000 Class :character
## Median :10.30 Median :6.000 Mode :character
## Mean :10.49 Mean :5.818
## 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :14.90 Max. :9.000
The quality of wine has a slightly skewed normal distribution. Most wine were rated as 5 or 6. The lowest rating is 3 and the highest rating is 9. We would like to plot the distribution of each individual factors and try to find the potential relationships.
At the first glance, the following factors have normal distribution: 1. Fixed Acidity 2. Volatile.acidity 3. Desity 4. PH
And, the following factors have a slightly skewed distribution,which is more like the quality distribution: 1. citric.acid 2. residual.sugar 3. chlorides 4. free.sulfur.dioxide 5. total.sulfur.dioxide 6. sulphates 7. alcohol
Due to the nature of the description, the (11) factors can be classified as following: 1. Acids 2. Sugar 3. Alcohol 4. Chlorides 5. Sulphates
We will mainly examine these (5) factors and their relationship to quality.
There are 6497 observations of 14 variables (X,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality,color). Quality is an ordered, categorical, discrete variable. It was on a 0-10 scale, rated by at least 3 wine experts. The values ranged only from 3 to 9, with a mean of 5.818 and median of 6. X is the numbering system for the wine samples. Color was a created categorical factor. All other variables are all quantitative factors about the chemical content in wine.
The main feature of interest is the factors affecting the quality of red/white wine. I suspected that the alcohol, residual.sugar and PH will affect the quality of red/white wine. The other point of interest is the difference between red/white wine.
From the description of the variables, it seems that the fixed.acidity & volatile.acidity, free.sulfur.dioxide & total.sulfur.dioxide, alcohol & density can be corralated variables.
Yes, ‘color’ was the created new variables.
Factors like residual.sugar/free.sulfur.dioxide has significant outliers. However, considering the unit used, the outliers can be accepted and the data is tidy data.
Density has negative relationship with alcohol. It also has positive correlation with residual sugar. The correlation coefficients are -0.687 and 0.553 respectively.
The white wine tend to have more alcohol, more residual sugar and less acids, less sulphates and chlorides.
As it has been assumed in section 1, there are some instinct relationship between the variables. For example, the free.sulfur.dioxide and total.sulfur.dioxide are positively related to each other. pH has negative relationship with acids.
The strongest relationship is between density and alhocol (R=-0.687), which makes sense because alhocol has smaller desity than water (desity = 49.3 lb/ft^3 and 62.4 lb/ft^3)
## Warning: Removed 1204 rows containing non-finite values (stat_smooth).
## Warning: Removed 1214 rows containing missing values (geom_point).
In this section, I found that the quality is related to alcohol, residual sugar, sulphates, chlorides, and acids.
The standards used to judge the quality of red wine and white wine are different. For red wine, both residual sugar and acidityhave positive relationship with the quality. However, for white wine, both factors are negatively related to the quality. Sulphate has positive effect in red wine but white wine is not sensitive to sulphates. Chlorides has negative effect on both red wine and white wine, but there are significantly amount of outliers.
The quality of wine has a slightly skewed normal distribution. Most wine were rated as 5 or 6. The lowest rating is 3 and the highest rating is 9.
This picture depicts the difference between red and white wine. Red wine has more acids, sulphate, chlorides, less sugar and slightly less alcohol.
The trend for relationship between alcohol content and quality are rather similar for both red and white wine. The wine rated as 5 has the lowest alcohol content. Overall, alcohol has positive relationship with the wine quality.
“The biggest difference between reds and whites is in how they’re made. The grapes used for red and white wines generally look very different—as you might imagine, red wine grapes are darker and have more pigment. When making white wine, typically the grapes are pressed and then just the juice is fermented.”1
The nature and brewing processes made the telling difference. Through the data, we looked into the differences between red and white wine from their chemical contents. Compared to the red wine, the white wine tend to have higher alcohol, more residual sugar and less acids, less sulphates and chlorides (probably because of the brewing process).
Some facotrs affecting quality also differed in red and white wine. Residual sugar and acids made positive contribution to the quality but they will decrease the taste for white wine. Sulphate positively influenced the red wine quality but white wine seems to be insensitive to this chemical. Both wine proned to rate higher alcohol content as better quality.
After all, quality rating is a relatively subjective factor. Human-beings, even the experts have their limits in distinguishing the tiny difference between each sample, not mentioned the consumers. That’s probably why most wine were rated as 5 or 6. If more extreme cases (below 3 or greater than 8) can be gathered, I would be interested to see why those samples stand out as unique.
Reference: 1. http://www.winespectator.com/drvinny/show/id/44697